Skip to content

feat: add native Parquet file reading support via extension#71

Open
BingqingLyu wants to merge 16 commits intoalibaba:mainfrom
BingqingLyu:read_parquet_extension
Open

feat: add native Parquet file reading support via extension#71
BingqingLyu wants to merge 16 commits intoalibaba:mainfrom
BingqingLyu:read_parquet_extension

Conversation

@BingqingLyu
Copy link
Collaborator

@BingqingLyu BingqingLyu commented Mar 17, 2026

What do these changes do?

This PR adds native support for reading Apache Parquet format files in NeuG through a new Parquet extension. Users can now directly load graph data from Parquet files without requiring format conversion, integrating seamlessly with NeuG's existing data import mechanisms.

Related issue number

Fixes #56

Greptile Summary

This PR adds a native Apache Parquet file reading extension to NeuG, enabling users to directly LOAD FROM Parquet files without format conversion. The implementation follows NeuG's existing JSON extension pattern, introducing ArrowParquetOptionsBuilder (extending ArrowOptionsBuilder), a ParquetReadFunction with execFunc/sniffFunc, and an Init()/Name() entry point — all wired into the CMake build system with conditional Arrow Parquet/Snappy/Zlib support and CMake 4.x compatibility patches.

Key issues found:

  • use_embedded_schema option is a no-op: The option is declared in ParquetParseOptions, prominently documented in both parquet_options.h and load_parquet.md, but buildFragmentOptions() never calls parquetOpts.use_embedded_schema.get(options). Passing use_embedded_schema=false to LOAD FROM has absolutely no effect, which will mislead users.
  • TestOptionsTranslation_BufferedStream provides false test coverage: The test for the BUFFERED_STREAM=false path asserts only that reader_properties != nullptr, which is always true. The test body even acknowledges Arrow has no getter for the buffered-stream flag, yet the test is not removed or marked as a limitation, giving false confidence this path is verified.
  • cache_decompressed semantics are misleading: The option is documented as "Cache decompressed column chunks to accelerate repeated reads" but it actually configures Arrow's I/O hole-filling (read coalescing) cache via set_cache_options(). The mapping may also be logically inverted: cache_decompressed=trueLazyDefaults() (lazy=true, on-demand fetching) rather than the more eager Defaults() (lazy=false).
  • CMake CMP0169 OLD defers migration: Setting CMP0169 OLD suppresses the CMake 4.x error for FetchContent_Populate() but does not migrate to the recommended FetchContent_MakeAvailable() pattern; this should be tracked as a follow-up.

Confidence Score: 3/5

  • The core extension is structurally sound and follows existing patterns, but a documented option (use_embedded_schema) is completely non-functional, which blocks merging without either fixing the implementation or removing the option from the docs and API.
  • The overall architecture is well-designed, mirrors the JSON extension correctly, and includes broad test coverage. However, the use_embedded_schema option being silently non-functional is a clear correctness gap that will confuse users. Combined with the misleading cache_decompressed semantics and a test that provides false coverage for BUFFERED_STREAM=false, the implementation has enough quality issues to warrant revisions before merging.
  • extension/parquet/src/parquet_options.cc requires the most attention — use_embedded_schema must be either implemented or removed, and the cache_decompressed/CacheOptions mapping should be clarified. extension/parquet/tests/parquet_test.cpp also needs the BufferedStream test corrected.

Important Files Changed

Filename Overview
extension/parquet/src/parquet_options.cc Core Parquet options builder; has a critical bug where the use_embedded_schema option is declared and documented but never read/applied in buildFragmentOptions(). Also contains a potential semantic inversion in cache_decompressed logic mapping to Arrow's CacheOptions.
extension/parquet/include/parquet_options.h Header defining ParquetParseOptions and ArrowParquetOptionsBuilder. Well-structured and follows the existing CSV builder pattern, but use_embedded_schema is declared here yet never used in the implementation.
extension/parquet/include/parquet_read_function.h Implements ParquetReadFunction with execFunc and sniffFunc. Pattern mirrors the JSON extension correctly. sniffFunc has proper error handling; execFunc error handling depends on ArrowReader::read() always throwing on failure.
extension/parquet/src/parquet_extension.cpp Minimal extension entry point registering ParquetReadFunction and extension metadata. Correctly mirrors the JSON extension pattern with proper exception handling.
extension/parquet/tests/parquet_test.cpp Comprehensive C++ test suite covering options translation, type mapping, integration, and multi-file handling. The TestOptionsTranslation_BufferedStream test does not actually verify the disabled state and provides false assurance, as Arrow doesn't expose a getter for buffered-stream status.
cmake/BuildArrowAsThirdParty.cmake Extended to conditionally enable Arrow Parquet and compression codec support. Includes a CMake 4.x compatibility patch for FetchContent_Populate() and Arrow's ThirdpartyToolchain.cmake. The CMP0169 OLD policy suppresses an error but defers the real migration to FetchContent_MakeAvailable().
tools/python_bind/tests/test_load.py Adds well-structured Python integration tests for Parquet LOAD FROM, column projection, WHERE filtering, ORDER BY, and COPY FROM. Properly guarded by the NEUG_RUN_PARQUET_TESTS environment variable following the JSON test pattern.
doc/source/extensions/load_parquet.md User-facing documentation for the Parquet extension. Documents use_embedded_schema as a supported option, but this option currently has no effect in the implementation, which would mislead users.

Sequence Diagram

sequenceDiagram
    participant User as User (Cypher)
    participant Ext as ParquetExtension (Init/Name)
    participant RF as ParquetReadFunction
    participant FSP as LocalFileSystemProvider
    participant OB as ArrowParquetOptionsBuilder
    participant AR as ArrowReader
    participant Arrow as Apache Arrow Dataset API

    User->>Ext: LOAD PARQUET / INSTALL PARQUET
    Ext->>RF: registerFunction(ParquetReadFunction)

    User->>RF: LOAD FROM "file.parquet" RETURN *
    RF->>FSP: provide(fileSchema)
    FSP-->>RF: resolvedPaths + fileSystem

    alt Schema inference (sniffFunc)
        RF->>OB: ArrowParquetOptionsBuilder(state)
        RF->>AR: ArrowReader(state, optionsBuilder, fs)
        AR->>Arrow: inferSchema()
        Arrow-->>AR: Arrow Schema
        AR-->>RF: EntrySchema
    end

    RF->>OB: build()
    OB->>OB: buildFragmentOptions()<br/>(buffered_stream, pre_buffer,<br/>cache_decompressed, row_batch_size)
    OB->>OB: buildFileFormat(scanOptions)<br/>(ParquetFileFormat + default_fragment_scan_options)
    OB-->>RF: ArrowOptions(scanOptions, fileFormat)

    RF->>AR: read(localState, ctx)
    AR->>Arrow: createScanner(fs)
    Arrow-->>AR: Scanner
    AR->>Arrow: full_read / batch_read
    Arrow-->>AR: RecordBatches
    AR-->>RF: execution::Context
    RF-->>User: Result rows
Loading

Comments Outside Diff (1)

  1. extension/parquet/tests/parquet_test.cpp, line 1187-1210 (link)

    P1 TestOptionsTranslation_BufferedStream does not actually verify the disabled case

    The test sets BUFFERED_STREAM=false and then its only assertion is:

    EXPECT_NE(parquetFragmentOpts->reader_properties, nullptr)
        << "Extension should configure reader_properties based on BUFFERED_STREAM option";

    This assertion is always true regardless of whether buffered stream is enabled or disabled (the previous test suite already verifies reader_properties is non-null with default options). The comment in the test even acknowledges this: "Arrow doesn't expose a getter for is_buffered_stream_enabled".

    Because parquet::ReaderProperties does not expose a getter for the buffered-stream flag, the only meaningful way to test this branch is either:

    • Verify I/O timing/behavior in an integration test, or
    • Remove the test case since it cannot differentiate between enabled and disabled states.

    As written, this test gives false confidence that the BUFFERED_STREAM=false path is exercised correctly.

Last reviewed commit: 1fb1d56

Greptile also left 4 inline comments on this PR.

(5/5) You can turn off certain types of comments like style here!

Comment on lines +74 to +115
ParquetParseOptions parquetOpts;
ReadOptions readOpts;

// Configure Parquet-specific reader properties
auto reader_properties = std::make_shared<parquet::ReaderProperties>();

// Enable buffered stream if configured
if (parquetOpts.buffered_stream.get(options)) {
reader_properties->enable_buffered_stream();
}

// Set I/O buffer size in bytes
int64_t buffer_size = readOpts.batch_size.get(options);
reader_properties->set_buffer_size(buffer_size);

fragment_scan_options->reader_properties = reader_properties;

// Configure Arrow-specific reader properties
auto arrow_reader_properties = std::make_shared<parquet::ArrowReaderProperties>();

// Set Arrow row batch size (number of rows per batch)
int64_t row_batch_size = parquetOpts.row_batch_size.get(options);
arrow_reader_properties->set_batch_size(row_batch_size);

// Use threads setting from general read options
arrow_reader_properties->set_use_threads(readOpts.use_threads.get(options));

// Configure pre-buffering for high-latency filesystems
arrow_reader_properties->set_pre_buffer(parquetOpts.pre_buffer.get(options));

// Configure caching of decompressed data
if (parquetOpts.cache_decompressed.get(options)) {
arrow_reader_properties->set_cache_options(
arrow::io::CacheOptions::LazyDefaults());
} else {
arrow_reader_properties->set_cache_options(
arrow::io::CacheOptions::Defaults());
}

fragment_scan_options->arrow_reader_properties = arrow_reader_properties;

return fragment_scan_options;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 use_embedded_schema option declared but never read

ParquetParseOptions defines use_embedded_schema (default: true) and it is documented in both parquet_options.h and load_parquet.md as a supported option, but buildFragmentOptions() never calls parquetOpts.use_embedded_schema.get(options). Every other option (buffered_stream, pre_buffer, cache_decompressed, row_batch_size) is consumed, but this one is completely ignored, so passing use_embedded_schema=false to LOAD FROM has no observable effect.

The option should either be wired into the Arrow ParquetFragmentScanOptions / ArrowReaderProperties (e.g., to decide whether to trust the embedded schema vs. infer from data), or removed from the struct, the header doc, and the user-facing documentation.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in ff02825. Removed the use_embedded_schema option entirely — it was declared but never read by buildFragmentOptions(). Arrow's ParquetFileFormat uses the embedded Parquet schema by default and provides no straightforward API to override this at the ParquetFragmentScanOptions level. The option has been removed from ParquetParseOptions, the header doc comment, and load_parquet.md.

Comment on lines +104 to +111
// Configure caching of decompressed data
if (parquetOpts.cache_decompressed.get(options)) {
arrow_reader_properties->set_cache_options(
arrow::io::CacheOptions::LazyDefaults());
} else {
arrow_reader_properties->set_cache_options(
arrow::io::CacheOptions::Defaults());
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 cache_decompressed semantic mismatch with Arrow's CacheOptions

The option is documented as "Cache decompressed column chunks to accelerate repeated reads", but ArrowReaderProperties::set_cache_options() configures Arrow's I/O hole-filling cache (read coalescing), not decompressed column chunk caching.

Additionally, the mapping may be logically inverted: cache_decompressed=true applies CacheOptions::LazyDefaults() (lazy=true), which means the cache only loads bytes that are explicitly requested (i.e., does not pre-fetch). CacheOptions::Defaults() (lazy=false) fetches data more eagerly. If the intent is to "cache aggressively for repeated reads", the non-lazy (Defaults) option would be more appropriate when cache_decompressed=true.

Consider:

  1. Renaming the option to something that more accurately reflects what it controls (e.g., enable_io_coalescing or lazy_io_cache), and updating the documentation accordingly.
  2. Verifying whether lazy=true or lazy=false actually delivers the desired read-performance behaviour for the intended use-case.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in ff02825. Renamed cache_decompressed to enable_io_coalescing to accurately reflect what arrow::io::CacheOptions controls — this is Arrow's I/O read coalescing (hole-filling cache) for non-contiguous byte ranges, not decompressed column chunk caching. The option key is now ENABLE_IO_COALESCING, with updated comments in parquet_options.h, parquet_options.cc, and load_parquet.md. The lazy/eager semantics (LazyDefaults vs Defaults) are preserved as-is, since lazy coalescing (only load explicitly requested ranges) is the natural default when coalescing is enabled.

Comment on lines +48 to +68
static execution::Context execFunc(
std::shared_ptr<reader::ReadSharedState> state) {
// Get file system from provider
LocalFileSystemProvider fsProvider;
auto fileInfo = fsProvider.provide(state->schema.file);
state->schema.file.paths = fileInfo.resolvedPaths;

// Create Parquet-specific options builder
auto optionsBuilder =
std::make_unique<reader::ArrowParquetOptionsBuilder>(state);

// Create Arrow reader with Parquet options
auto reader = std::make_unique<reader::ArrowReader>(
state, std::move(optionsBuilder), fileInfo.fileSystem);

// Execute read operation
execution::Context ctx;
auto localState = std::make_shared<reader::ReadLocalState>();
reader->read(localState, ctx);
return ctx;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 execFunc silently ignores errors from reader->read()

The read() call on line 66 populates ctx but does not return a status. If ArrowReader::read() internally encounters an Arrow error (e.g., corrupted file, schema mismatch), errors may only surface as logged warnings rather than propagated exceptions, leaving ctx in a partially populated state that the caller cannot distinguish from a successful read.

Comparing with the sniffFunc below (lines 97–101), which explicitly checks sniffResult and throws on failure, execFunc has no such guard. It is worth confirming that ArrowReader::read() always throws on failure (rather than logging and continuing) so that errors are not silently swallowed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed in ff02825. Added an explicit comment in execFunc to document this. ArrowReader::read() propagates all errors via exceptions (THROW_IO_EXCEPTION / THROW_INVALID_ARGUMENT_EXCEPTION) — see src/utils/reader/reader.cc where every Arrow Result and Status failure path checks .ok() and throws. There is no silent swallowing path.

Comment on lines 117 to 121
# Point Arrow to use the project's RapidJSON
set(RapidJSON_ROOT "${CMAKE_SOURCE_DIR}/third_party/rapidjson" CACHE PATH "" FORCE)
endif()
set(ARROW_WITH_ZLIB OFF CACHE BOOL "" FORCE)
set(ARROW_ENABLE_THREADING ON CACHE BOOL "" FORCE)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 cmake_policy(SET CMP0169 OLD) suppresses a CMake 4.x deprecation warning but defers the real fix

Setting CMP0169 OLD silences the deprecation of the legacy FetchContent_Populate() pattern, but this is marked as an error in CMake 4.x by default, meaning the workaround must stay in sync with future CMake releases. The FetchContent_MakeAvailable() API (available since CMake 3.14) is the recommended migration path.

This is acceptable as a short-term compatibility fix, but a follow-up to migrate away from FetchContent_Populate() should be tracked, especially since the project already handles CMake 4.x incompatibilities in other parts of this patch.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged. The cmake_policy(SET CMP0169 OLD) is an intentional short-term compatibility shim to unblock CMake 4.x builds. Migrating the affected FetchContent_Populate() calls to FetchContent_MakeAvailable() is tracked as a follow-up — the current patch keeps the existing FetchContent call sites working without regressions while the proper migration is planned separately.

- Remove use_embedded_schema option (was declared but never read by
  buildFragmentOptions(); Arrow always uses the embedded Parquet schema
  by default and provides no simple API to override this)
- Rename cache_decompressed -> enable_io_coalescing to accurately reflect
  what arrow::io::CacheOptions controls (I/O hole-filling / read
  coalescing, not decompressed column chunk caching); update comments
  and docs accordingly
- Add comment in execFunc clarifying that ArrowReader::read() propagates
  all errors via exceptions, not silent logging
- Remove TestOptionsTranslation_BufferedStream test which could not
  meaningfully verify the disabled case (Arrow exposes no getter for
  is_buffered_stream_enabled); rename TestOptionsTranslation_CacheOptions
  -> TestOptionsTranslation_IoCoalescing and update option keys to match
  the renamed ENABLE_IO_COALESCING option
@BingqingLyu
Copy link
Collaborator Author

Addressed all review comments in ff02825:

  1. use_embedded_schema is a no-op (P1): Removed the option from ParquetParseOptions, the header doc comment, and load_parquet.md. Arrow's ParquetFileFormat uses the embedded Parquet schema by default and provides no straightforward ParquetFragmentScanOptions-level API to override this.

  2. cache_decompressed semantic mismatch (P2): Renamed to enable_io_coalescing (key: ENABLE_IO_COALESCING). The option controls Arrow's I/O hole-filling / read coalescing cache — not decompressed column chunk caching. Updated the comment in parquet_options.cc, the doc block in parquet_options.h, and the user-facing load_parquet.md.

  3. execFunc error propagation (P2): Verified that ArrowReader::read() throws on all error paths (THROW_IO_EXCEPTION / THROW_INVALID_ARGUMENT_EXCEPTION in reader.cc). Added an explicit comment in execFunc to make this clear.

  4. CMP0169 OLD CMake shim (P2): Acknowledged as intentional short-term fix; FetchContent_Populate()FetchContent_MakeAvailable() migration tracked as follow-up. No code change needed.

  5. TestOptionsTranslation_BufferedStream false coverage: Removed the test — parquet::ReaderProperties exposes no getter for is_buffered_stream_enabled, so the only assertion (reader_properties != nullptr) was always true regardless of the option value. Renamed TestOptionsTranslation_CacheOptionsTestOptionsTranslation_IoCoalescing and updated it to use the new ENABLE_IO_COALESCING option key.

# ============================================================
extension_tests_default:
runs-on: [self-hosted]
container:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Should trigger neug-extension-test.yml at least for this PR.
  • We should first run neug-extension-test.yml with image: 0.1.0-nightly. If it works find, retag it to 0.1.0 and push. cc @lnfjpt


# Configure Arrow Parquet support if parquet extension is enabled
if(BUILD_EXTENSIONS AND "parquet" IN_LIST BUILD_EXTENSIONS)
set(ARROW_PARQUET ON CACHE BOOL "" FORCE)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ARROW_PARQUET itself is a guard in Arrow CMakeList, maybe we should use a different name here, i.e. ARROW_ENABLE_PARQUET

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] Add native support for Apache Parquet file format

2 participants